Trying out DMM clustering using this workflow (https://microbiome.github.io/tutorials/DMM.html).
library(here)## here() starts at /Users/bowiek/Library/CloudStorage/OneDrive-OregonHealth&ScienceUniversity/MrOS
library(phyloseq)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(microshades)
library(patchwork)
library(ggpubr)
library(rstatix)##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library(microbiome)##
## microbiome R package (microbiome.github.com)
##
##
##
## Copyright (C) 2011-2022 Leo Lahti,
## Sudarshan Shetty et al. <microbiome.github.io>
##
## Attaching package: 'microbiome'
## The following object is masked from 'package:ggplot2':
##
## alpha
## The following object is masked from 'package:base':
##
## transform
library(DirichletMultinomial)## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:tidyr':
##
## expand
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:microbiome':
##
## coverage
## The following object is masked from 'package:rstatix':
##
## desc
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## The following object is masked from 'package:phyloseq':
##
## distance
library(reshape2)##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(magrittr)##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
This ps object was cleaned up in Box/Kate/MrOS/data/Analyses/MrOS_EDA.Rmd. ALl of the data is described explicitly in that RMD. The ps object is rarefied
Code from Erin Dahl created 04/05/22 sent to me in an email. Labeled DMM_MrOs.html
load(here("data", "ps_subcat.RData"))
# Creating a variable with an untouched ps object
ps_orig <- psps <- tax_glom(ps, "Genus")# To speed up, only consider the core taxa
# that are prevalent at 0.01% relative abundance in 20% of the samples
# (note that this is not strictly correct as information is
# being discarded; one alternative would be to aggregate rare taxa)
ps_comp <- microbiome::transform(ps_orig, "compositional") #changes all abundances to relative
taxa <- core_members(ps_comp, detection = 0.01/100, prevalence = 10/100) # only considers taxa at 0.01% relative abundance in 20% of samples
ps_sub <- prune_taxa(taxa, ps)
taxa_names(ps_sub) <- data.frame(ps_sub@tax_table)$Genus #taking everything to the genus level, I guess?
# Pick the OTU count matrix
# and convert it into samples x taxa format
dat <- abundances(ps_sub)
count <- as.matrix(t(dat))
# remove the five rows with 0 sum
count_fixed <- count[rowSums(count) >0,]
count_removed <- count[rowSums(count) == 0,]
#Samples removed
rownames(count_removed)## NULL
Starts with 571, but goes down to 34 once looking at core taxa (prevalence = 20/100)
Starts with 571, goes down to 54 taxa (prevalence 10/100)
This seed supposedly keeps the model consistent no matter how many times run. There are slight variations, but largely the same.
fit <- lapply(1:12, dmn, seed = 120821, count = count_fixed, verbose=TRUE)## dmn, k=1
## Soft kmeans
## Expectation Maximization setup
## Expectation Maximization
## Hessian
## dmn, k=2
## Soft kmeans
## iteration 10 change 0.006313
## iteration 20 change 0.003247
## iteration 30 change 0.010098
## iteration 40 change 0.016607
## iteration 50 change 0.000195
## iteration 60 change 0.000001
## Expectation Maximization setup
## Expectation Maximization
## iteration 10 change 3.537368
## iteration 20 change 9.402401
## iteration 30 change 0.000003
## Hessian
## dmn, k=3
## Soft kmeans
## iteration 10 change 0.016840
## iteration 20 change 0.001601
## iteration 30 change 0.000007
## Expectation Maximization setup
## Expectation Maximization
## iteration 10 change 1.609024
## iteration 20 change 0.000478
## iteration 30 change 0.000001
## Hessian
## dmn, k=4
## Soft kmeans
## iteration 10 change 0.028317
## iteration 20 change 0.000251
## iteration 30 change 0.000005
## Expectation Maximization setup
## Expectation Maximization
## iteration 10 change 3.994406
## iteration 20 change 5.707279
## iteration 30 change 0.121941
## iteration 40 change 0.000081
## Hessian
## dmn, k=5
## Soft kmeans
## iteration 10 change 0.016981
## iteration 20 change 0.000167
## iteration 30 change 0.000001
## Expectation Maximization setup
## Expectation Maximization
## iteration 10 change 4.542790
## iteration 20 change 6.254182
## iteration 30 change 0.026815
## iteration 40 change 0.000031
## Hessian
## dmn, k=6
## Soft kmeans
## iteration 10 change 0.024778
## iteration 20 change 0.000042
## Expectation Maximization setup
## Expectation Maximization
## iteration 10 change 13.296472
## iteration 20 change 0.316766
## iteration 30 change 0.194911
## iteration 40 change 1.406707
## iteration 50 change 7.190685
## iteration 60 change 0.046072
## iteration 70 change 0.000835
## iteration 80 change 0.000156
## iteration 90 change 0.000069
## iteration 100 change 0.000047
## Hessian
## dmn, k=7
## Soft kmeans
## iteration 10 change 0.018533
## iteration 20 change 0.041837
## iteration 30 change 0.000038
## Expectation Maximization setup
## Expectation Maximization
## iteration 10 change 12.245453
## iteration 20 change 14.909620
## iteration 30 change 5.048033
## iteration 40 change 0.260315
## iteration 50 change 0.223842
## iteration 60 change 0.238205
## iteration 70 change 0.069831
## iteration 80 change 0.008170
## iteration 90 change 0.019363
## iteration 100 change 0.001550
## Hessian
## dmn, k=8
## Soft kmeans
## Expectation Maximization setup
## Expectation Maximization
## iteration 10 change 7.511017
## iteration 20 change 0.628649
## iteration 30 change 0.704308
## iteration 40 change 0.524228
## iteration 50 change 0.035005
## iteration 60 change 0.127707
## iteration 70 change 0.004504
## iteration 80 change 0.003741
## iteration 90 change 0.001009
## iteration 100 change 0.000106
## Hessian
## dmn, k=9
## Soft kmeans
## iteration 10 change 0.051147
## iteration 20 change 0.023488
## iteration 30 change 0.000211
## iteration 40 change 0.000002
## Expectation Maximization setup
## Expectation Maximization
## iteration 10 change 27.770269
## iteration 20 change 17.282921
## iteration 30 change 0.485548
## iteration 40 change 0.009255
## iteration 50 change 0.000088
## iteration 60 change 0.000001
## Hessian
## dmn, k=10
## Soft kmeans
## iteration 10 change 0.079910
## iteration 20 change 0.001438
## iteration 30 change 0.000040
## iteration 40 change 0.000001
## Expectation Maximization setup
## Expectation Maximization
## iteration 10 change 19.629779
## iteration 20 change 1.011632
## iteration 30 change 0.216636
## iteration 40 change 0.000073
## Hessian
## dmn, k=11
## Soft kmeans
## iteration 10 change 0.013396
## iteration 20 change 0.000629
## iteration 30 change 0.000007
## Expectation Maximization setup
## Expectation Maximization
## iteration 10 change 29.439336
## iteration 20 change 2.329879
## iteration 30 change 0.804719
## iteration 40 change 0.141667
## iteration 50 change 5.217509
## iteration 60 change 2.807502
## iteration 70 change 0.473148
## iteration 80 change 0.026707
## iteration 90 change 0.081518
## iteration 100 change 0.000103
## Hessian
## dmn, k=12
## Soft kmeans
## iteration 10 change 0.017565
## iteration 20 change 0.000485
## iteration 30 change 0.000205
## iteration 40 change 0.000076
## iteration 50 change 0.000025
## iteration 60 change 0.000008
## iteration 70 change 0.000003
## Expectation Maximization setup
## Expectation Maximization
## iteration 10 change 16.471662
## iteration 20 change 44.193673
## iteration 30 change 3.532473
## iteration 40 change 0.178164
## iteration 50 change 0.104794
## iteration 60 change 1.252516
## iteration 70 change 0.024182
## iteration 80 change 0.022576
## iteration 90 change 0.091442
## iteration 100 change 0.021216
## Hessian
# runif(1, 0, .Machine$integer.max)# Methods for evaluating the fit of the model
lplc <- base::sapply(fit, DirichletMultinomial::laplace) # AIC / BIC / Laplace
aic <- base::sapply(fit, DirichletMultinomial::AIC) # AIC / BIC / Laplace
bic <- base::sapply(fit, DirichletMultinomial::BIC) # AIC / BIC / Laplace
# Plot model fit evaluation
plot(lplc, type="b", xlab="Number of Dirichlet Components", ylab="Model Fit")
lines(aic, type="b", lty = 2)
lines(bic, type="b", lty = 3)
Lisa instructed me to look at clusters 4-12. According to this diagram,
8 at is at minimum, meaning it has the best clustering. However,
according to Paul Spellman the less clusters the better and he typically
looks at a difference of 0, so clusters 4-6.
fit_4 <- fit[[4]]
fit_5 <- fit[[5]]
fit_6 <- fit[[6]]
fit_7 <- fit[[7]]
fit_8 <- fit[[8]]
fit_9 <- fit[[9]]
fit_10 <- fit[[10]]
fit_11 <- fit[[11]]
fit_12 <- fit[[12]]assignments_4 <- data.frame(apply(mixture(fit_4), 1, which.max))
colnames(assignments_4) <- "k_4"
assignments_5 <- data.frame(apply(mixture(fit_5), 1, which.max))
colnames(assignments_5) <- "k_5"
assignments_6 <- data.frame(apply(mixture(fit_6), 1, which.max))
colnames(assignments_6) <- "k_6"
assignments_7 <- data.frame(apply(mixture(fit_7), 1, which.max))
colnames(assignments_7) <- "k_7"
assignments_8 <- data.frame(apply(mixture(fit_8), 1, which.max))
colnames(assignments_8) <- "k_8"
assignments_9 <- data.frame(apply(mixture(fit_9), 1, which.max))
colnames(assignments_9) <- "k_9"
assignments_10 <- data.frame(apply(mixture(fit_10), 1, which.max))
colnames(assignments_10) <- "k_10"
assignments_11 <- data.frame(apply(mixture(fit_11), 1, which.max))
colnames(assignments_11) <- "k_11"
assignments_12 <- data.frame(apply(mixture(fit_12), 1, which.max))
colnames(assignments_12) <- "k_12"
dmm_classification <- merge(assignments_4, assignments_5, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL
dmm_classification <- merge(dmm_classification, assignments_6, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL
dmm_classification <- merge(dmm_classification, assignments_7, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL
dmm_classification <- merge(dmm_classification, assignments_8, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL
dmm_classification <- merge(dmm_classification, assignments_9, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL
dmm_classification <- merge(dmm_classification, assignments_10, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL
dmm_classification <- merge(dmm_classification, assignments_11, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL
dmm_classification <- merge(dmm_classification, assignments_12, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULLtemp_sam <- data.frame(ps_orig@sam_data)
new_sam <- merge(temp_sam, dmm_classification, by = "row.names", all=TRUE)
rownames(new_sam) <- new_sam$Row.names
new_sam$Row.names <- NULL
sample_data(ps_orig) <- new_sammdf <- ps_orig %>%
psmelt()## Warning in psmelt(.): The sample variables:
## Sample
## have been renamed to:
## sample_Sample
## to avoid conflicts with special phyloseq plot attribute names.
color_dfs <- create_color_dfs(mdf, group_level = "Phylum", subgroup_level = "Genus")
color_dfs <- reorder_samples_by(color_dfs$mdf,color_dfs$cdf, order_tax = "Firmicutes")
plot_g <- plot_microshades(color_dfs$mdf,color_dfs$cdf)## Warning in is.na(cdf$hex) || is.na(cdf$group): 'length(x) = 25 > 1' in coercion
## to 'logical(1)'
## Warning in is.na(cdf$hex) || is.na(cdf$group): 'length(x) = 25 > 1' in coercion
## to 'logical(1)'
plot_g + facet_grid(~k_4, scales = "free_x")#temp <- data.frame(ps@tax_table)Overall it looks like clusters are similar in size, although cluster 1 is maybe a little smaller. It looks like cluster 1 is primarily Firmicutes (staph), mixture 2 is Firmicutes (other) mixed with Actinobacteria, 3 is a mixture of a bunch of different phyla, while the last cluster is mainly Proteobacteria (Neisseria). Cluster 3 is a little hard to define, but probably a mixed cluster.
plot_g + facet_grid(~k_5, scales = "free_x")
The first two clusters look similar to the k=4 graph above. Cluster 3 is
interesting because it’s primarily Proteogacteria and has the least
amount of Firmictues. Cluster 4 on the otherhand has the most Firmucutes
and seems to be mixed with Actinobacteria. Cluster 5 is primarily
Firmictutes-Other and Bacteroidetes. I def like these groupings better
than k=4.
plot_g + facet_grid(~k_6, scales = "free_x")
According to the fit graph, this is the second best cluster, with k=8
being the best. The first two clusters look similar. There’s still a
Proteobacteria dominated cluster (4) and a Bacteroidetes cluster, which
also has “other” Firmicutes (3). What’s added is cluster 6, which is
dominated by e.coli, however is made up of only 6 samples… which I’m
kinda eh about.
mixturewt(fit_6)| pi | theta |
|---|---|
| 0.2974321 | 2.253109 |
| 0.2018315 | 6.500637 |
| 0.1829721 | 20.323644 |
| 0.1688639 | 9.880016 |
| 0.1321411 | 10.400118 |
| 0.0167593 | 24.833727 |
I added the pi/theta measurements here. Pi is the amount of samples in each cluster.
plot_g + facet_grid(~k_7, scales = "free_x")
This is interesting, because according to the fit eval, the fit gets
slightly worse with 7 clusters. Also it has two of these smaller
clusters that are dominated by specific genera, either e.coli or a
Bacteroidetes one that has mostly firmicutes-other.
mixturewt(fit_7)| pi | theta |
|---|---|
| 0.3531051 | 2.489032 |
| 0.2043463 | 8.675721 |
| 0.1502068 | 9.878811 |
| 0.1382811 | 9.552434 |
| 0.1225539 | 32.661364 |
| 0.0167699 | 24.832315 |
| 0.0147370 | 51.866208 |
plot_g + facet_grid(~k_8, scales = "free_x") + theme(legend.position = "bottom", axis.text.x = element_blank(), legend.title = element_blank(), legend.text = element_text(size = 12))
According to the fit eval graph, this is the “optimal” number of
clusters. The last two clusters do not seem to have as many samples as
the others, however they look very similar. I find cluster 6 interesting
because it has very little firmicutes, and cluster 5 because it has a
lot of firmicutes however it’s specifically other genera (I wonder if
the genera are the same, but we can’t actually visualize it- I suppose
this is where contribution plots will come in handy).
mixturewt(fit_8)| pi | theta |
|---|---|
| 0.2649356 | 2.067840 |
| 0.1512581 | 6.996047 |
| 0.1364068 | 9.222521 |
| 0.1265088 | 7.127429 |
| 0.1187904 | 10.097499 |
| 0.0903646 | 15.676812 |
| 0.0612442 | 79.621443 |
| 0.0504915 | 16.303491 |
plot_g + facet_grid(~k_9, scales = "free_x")
Here we get back to these very very low number of sample clusters (7-9),
which I think are kind of funky and I’m not into them, although they do
have a very high theta (measure of similarity).
mixturewt(fit_9)| pi | theta |
|---|---|
| 0.2438355 | 2.050104 |
| 0.1728209 | 7.056511 |
| 0.1465091 | 9.423708 |
| 0.1331145 | 10.218135 |
| 0.1319222 | 10.700293 |
| 0.1214418 | 32.825735 |
| 0.0188243 | 7.316058 |
| 0.0167960 | 24.806029 |
| 0.0147357 | 50.773428 |
plot_g + facet_grid(~k_10, scales = "free_x")
I think these clusters are getting small, and the very specific clusters
of <10 samples may make for some wonky analyses.
plot_g + facet_grid(~k_11, scales = "free_x")
Getting into very small clusters and I don’t think we should consider
this one.
plot_g + facet_grid(~k_12, scales = "free_x")
These clusters are very very small, so I don’t think this is worth
analyzing.
rm(list=ls()[! ls() %in% c("ps_orig","new_sam")])#save.image(here("data", "ps_dmm_6.RData"))